Goto

Collaborating Authors

 dataset construction



A Few-shot MiniImageNet 402 The dataset construction is based on MiniImageNet [ 26 ], following the method of Tsimpoukelli et al

Neural Information Processing Systems

A 256 256 image size is used so that the ViT encoder generates 256 tokens. We follow the process used in Tsimpoukelli et al. Randomly sampled image from ImageNet. Randomly sampled image from ImageNet.


A Appendix A.1 Datasheet for UrbanKG Dataset

Neural Information Processing Systems

We have presented the UrbanKG dataset construction progress in Section 2. The detailed statistics of We implement all models by using PyTorch. All experiments are conducted on eight NVIDIA RTX 3090 GPUs. For example, [2020/4/1/4:20, (40.68, -74.01), 2020/4/1/4:26, (40.68, -73.99)] is Human Mobility: We construct human mobility dataset based on taxi service and bike trip data. The inflow and outflow of a POI can be calculated by counting the number of taxi passengers, taxi drivers and bikers, who enter and leave within a period of time. NYC mobility prediction, we calculate the inflow and outflow at each POI at 30-minute intervals from April 1st to June 31st, 2020.


Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Cen, Zhepeng, Chen, Haolin, Wang, Shiyu, Liu, Zuxin, Liu, Zhiwei, Zhao, Ding, Savarese, Silvio, Xiong, Caiming, Wang, Huan, Yao, Weiran

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.



DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Sun, Haoran, Bian, Haoyu, Zeng, Shaoning, Rao, Yunbo, Xu, Xu, Mei, Lin, Gou, Jianping

arXiv.org Artificial Intelligence

Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.


TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

Huang, Chengrui, Gao, Shen, Shi, Zhengliang, Wang, Dongsheng, Shang, Shuo

arXiv.org Artificial Intelligence

Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.


Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation

Bertina, Abbas, Beirami, Shahab, Biniazian, Hossein, Esmaeilnia, Elham, Shahi, Soheil, Pirnia, Mahdi

arXiv.org Artificial Intelligence

Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges due to its complex phonological features, particularly homographs and Ezafe, which exist in formal and informal language contexts. This paper introduces an intermediate language specifically designed for Persian language processing that addresses these challenges through a multi-faceted approach. Our methodology combines two key components: Large Language Model (LLM) prompting techniques and a specialized sequence-to-sequence machine transliteration architecture. We developed and implemented a systematic approach for constructing a comprehensive lexical database for homographs with multiple pronunciations disambiguation often termed polyphones, utilizing formal concept analysis for semantic differentiation. We train our model using two distinct datasets: the LLM-generated dataset for formal and informal Persian and the B-Plus podcasts for informal language variants. The experimental results demonstrate superior performance compared to existing state-of-the-art approaches, particularly in handling the complexities of Persian phoneme conversion. Our model significantly improves Phoneme Error Rate (PER) metrics, establishing a new benchmark for Persian G2P conversion accuracy. This work contributes to the growing research in low-resource language processing and provides a robust solution for Persian text-to-speech systems and demonstrating its applicability beyond Persian. Specifically, the approach can extend to languages with rich homographic phenomena such as Chinese and Arabic


Hawk: An Efficient NALM System for Accurate Low-Power Appliance Recognition

Wang, Zijian, Zhang, Xingzhou, Wang, Yifan, Peng, Xiaohui, Xu, Zhiwei

arXiv.org Artificial Intelligence

Non-intrusive Appliance Load Monitoring (NALM) aims to recognize individual appliance usage from the main meter without indoor sensors. However, existing systems struggle to balance dataset construction efficiency and event/state recognition accuracy, especially for low-power appliance recognition. This paper introduces Hawk, an efficient and accurate NALM system that operates in two stages: dataset construction and event recognition. In the data construction stage, we efficiently collect a balanced and diverse dataset, HawkDATA, based on balanced Gray code and enable automatic data annotations via a sampling synchronization strategy called shared perceptible time. During the event recognition stage, our algorithm integrates steady-state differential pre-processing and voting-based post-processing for accurate event recognition from the aggregate current. Experimental results show that HawkDATA takes only 1/71.5 of the collection time to collect 6.34x more appliance state combinations than the baseline. In HawkDATA and a widely used dataset, Hawk achieves an average F1 score of 93.94% for state recognition and 97.07% for event recognition, which is a 47. 98% and 11. 57% increase over SOTA algorithms. Furthermore, selected appliance subsets and the model trained from HawkDATA are deployed in two real-world scenarios with many unknown background appliances. The average F1 scores of event recognition are 96.02% and 94.76%. Hawk's source code and HawkDATA are accessible at https://github.com/WZiJ/SenSys24-Hawk.


Channel Modeling Aided Dataset Generation for AI-Enabled CSI Feedback: Advances, Challenges, and Solutions

Li, Yupeng, Li, Gang, Wen, Zirui, Han, Shuangfeng, Gao, Shijian, Liu, Guangyi, Wang, Jiangzhou

arXiv.org Artificial Intelligence

The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation method based on a limited number of field channel data. Specifically, the user equipment (UE) extracts the primary stochastic parameters of the field channel data and transmits them to the base station (BS). The BS then updates the typical TR 38.901 model parameters with the extracted parameters. In this way, the updated channel model is used to generate the dataset. This strategy comprehensively considers the dataset collection, model generalization, model monitoring, and so on. Simulations verify that our proposed strategy can significantly improve performance compared to the benchmarks.